机器学习笔记

统计学习方法. 1

统计学习方法概论. 1

基本概念. 1

统计学习三要素. 1

模型的评价与模型选择. 2

机器学习. 3

加载数据集. 3

加载数据. 3

划分数据集. 3

学习和预测. 3

Example 3

保存模型. 4

统计学习方法

统计学习方法概论

基本概念

是联合概率分布，独立同分布产生数据。X,Y具有联合概率分布是监督学习的基本假设。

假设空间(Hypothesis space)中，用决策函数或条件概率分布两种表示。

统计学习三要素

方法=模型+策略+算法

模型

假设空间可以定义为决策函数的集合，也可以定义为条件概率的集合。参数向量取值于n维参数空间。

策略

损失函数一般用L(Y(x),f(x))表示；常用的损失函数有：

0-1损失函数：

平方损失函数：

绝对损失函数：L=|Y-f(X)|

对数损失函数或对数似然损失函数：

指数损失函数：

差比较图###

期望风险：

经验风险：

结构风险： .

结构风险最小化是为了防止过拟合提出来的模型，等价于正则化，J(f)为模型的复杂度，模型越复杂，越大，越简单，越小。贝叶斯估计中的最大后验概率就是结构风险最小的一个列子。

算法

根据策略，从假设空间中选取最优模型和参数。

模型的评价与模型选择

训练误差和测试误差

其中误差是由特定的损失函数算出来的，比如0-1，平方差损失函数等。

机器学习

加载数据集

加载数据

from sklearn import datasets

iris = datasets.load_iris()

划分数据集

随机排列

np.random.seed(0)
# indices = np.random.permutation(len(iris_X)) # 随机排列
# print indices
# iris_X_train = iris_X[indices[:-10]]
# iris_y_train = iris_y[indices[:-10]]
# iris_X_test = iris_X[indices[-10:]]
# iris_y_test = iris_y[indices[-10:]]

学习和预测

Example

from sklearn import svm

clf = svm.SVC(gamma=0.001, C=100)
clf.fit(digits.data[:-1], digits.target[:-1])
print(clf)
print clf.predict(digits.data[-1:])

更新参数（sklearn.pipeline.Pipeline.set_params）

clf.set_params(kernel='linear').fit(X, y)

clf.set_params(kernel='rbf').fit(X, y)

保存模型

import pickle



s = pickle.dumps(clf)

clf2 = pickle.loads(s)

print digits.target[-2]

print clf2.predict(digits.data[-2])



# 2

from sklearn.externals import joblib



joblib.dump(clf, 'filename.pkl')

clf = joblib.load('filename.pkl')

画出decision边界

# 颜色地图

cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])

cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

for weights in ['uniform', 'distance']:

    # we create an instance of Neighbours Classifier and fit the data.

    clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)

    clf.fit(X, y)



    # Plot the decision boundary. For that, we will assign a color to each

    # point in the mesh [x_min, m_max]x[y_min, y_max].

h = .02

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),

                         np.arange(y_min, y_max, h))

    print(xx)

    # Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    Z = clf.predict(np.array([xx.ravel(), yy.ravel()]).T)

    # Put the result into a color plot

    Z = Z.reshape(xx.shape)

    print(xx.shape)

    plt.figure()

# 填充数据网格

    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # plt.contourf(xx, yy, Z, alpha=0.4, cmap=cmap_light)

    # 画training points

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)

    plt.xlim(xx.min(), xx.max())

    plt.ylim(yy.min(), yy.max())

    plt.title("3-Class classification (k = %i, weights = '%s')" % (n_neighbors, weights))



plt.show()

模型选择:选模型、选参数

Score, and cross-validated scores

score method that can judge the quality of the fit (or the prediction) on new data. Bigger is better.

Cross-validation

kfold = cross_validation.KFold(len(X_digits), n_folds=3)

>>> [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) for train, test in kfold]

[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

>>> cross_validation.cross_val_score(svc, X_digits, y_digits, cv=kfold, n_jobs=-1)

array([ 0.93489149, 0.95659432, 0.93989983])

Cross-validation generators

KFold (n, k)	StratifiedKFold (y, k)	LeaveOneOut (n)	LeaveOneLabelOut (labels)
Split it K folds, train on K-1 and then test on left-out	It preserves the class ratios / label distribution within each fold.	Leave one observation out	Takes a label array to group observations

Grid-search and cross-validated estimators

from sklearn.grid_search import GridSearchCV

Cs = np.logspace(-6, -1, 10)

clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),n_jobs=-1)

>>> clf.fit(X_digits[:1000], y_digits[:1000])

GridSearchCV(cv=None,...

>>> clf.best_score_

0.925...

>>> clf.best_estimator_.C

0.0077...

>>> # Prediction performance on test set is not as good as on train set

>>> clf.score(X_digits[1000:], y_digits[1000:])

0.943...

GridSearchCV 默认用的 3-fold cross-validation。但是如果是分类器不是回归问题

将使用StratifiedKFold来保证每一折label比例相同。

Cross-validated estimators

from sklearn import linear_model, datasets

lasso = linear_model.LassoCV()

diabetes = datasets.load_diabetes()

X_diabetes = diabetes.data

y_diabetes = diabetes.target

lasso.fit(X_diabetes, y_diabetes)

LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,

max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,

precompute='auto', random_state=None, selection='cyclic', tol=0.0001,

verbose=False)

>>> # The estimator chose automatically its lambda:

>>> lasso.alpha_

0.01229...

维数灾

首先 Error = Bias + Variance

Error反映的是整个模型的准确度，Bias反映的是模型在样本上的输出与真实值之间的误差，即模型本身的精准度，Variance反映的是模型每一次输出结果与模型输出期望之间的误差，即模型的稳定性。

简单的数据N = 10 * d

严格的数据N = 10 ^ d

感知器

Perceptron Learning Algorithm

PLA与SGD的关系：

Perceptron and SGDClassifier share the same underlying implementation. In fact, Perceptron() is equivalent to SGDClassifier(loss=”perceptron”, eta0=1, learning_rate=”constant”, penalty=None)

K近邻算法

无显式学习过程,通过投票的分类策略进行k近邻分类。
模型是利用训练集数据对特征向量空间划分。
三要素：距离量度、k值选择、分类策略。
K值越大，模型越简单；k值越小，模型越复杂，容易过拟合。
投票策略规则等价于经验风险最小化。
线型扫描速度太慢，用kd树结构来提升性能。（数据结构知识需要）。

线性模型

from sklearn import linear_model

regr = linear_model.LinearRegression()

>>> regr.fit(diabetes_X_train, diabetes_y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

>>> print(regr.coef_)

[ 0.30349955 -237.63931533 510.53060544 327.73698041 -814.13170937

492.81458798 102.84845219 184.60648906 743.51961675 76.09517222]

>>> # The mean square error均方差

>>> np.mean((regr.predict(diabetes_X_test)-diabetes_y_test)**2)

2004.56760268...

>>> # Explained variance score: 1 is perfect prediction

>>> # and 0 means that there is no linear relationship

>>> # between X and Y.方差

>>> regr.score(diabetes_X_test, diabetes_y_test)

0.5850753022690...

Shrinkage 收缩

出现情况：每个维度数据少，且有高方差的噪声

regr = linear_model.LinearRegression()

pl.plot(test, regr.predict(test))

解决方案：

收缩回归系数到0；岭回归引入的偏差其实是正则化。捕捉噪声的规律使其不能再新数据上通用的情况叫做过拟合。

regr = linear_model.Ridge(alpha=.1)

pl.figure()

np.random.seed(0)

for _ in range(6):

this_X = .1*np.random.normal(size=(2, 1)) + X

regr.fit(this_X, y)

pl.plot(test, regr.predict(test))

pl.scatter(this_X, y, s=3)

alphas = np.logspace(-4, -1, 6)

from __future__ import print_function

>>> print([regr.set_params(alpha=alpha).fit(diabetes_X_train, diabetes_y_train,).score(diabetes_X_test, diabetes_y_test) for alpha in alphas])

[0.5851110683883..., 0.5852073015444..., 0.5854677540698..., 0.5855512036503..., 0.5830717085554..., 0.57058999437...]

典型的bias/variance tradeoff:大 ridge alpha parameter, 高bias and 低 variance.

Sparsity稀疏

Example：diabetes dataset 涉及11个维度，很难通过可视化手段分析出有用的信息，但是时刻谨记着模型（数据）有个能是个相当空的space，这也许非常重要。

稀疏是只选择有信息的特征，把没信息的特征系数置零。Ridge regression是减少还没有到零。Lasso (least absolute shrinkage and selection operator)是置零。这种方法被称为稀疏化或稀疏方法，同时他是一种Occam’s razor : prefer simpler models.

regr = linear_model.Lasso()

scores = [regr.set_params(alpha=alpha).fit(diabetes_X_train, diabetes_y_train).score(diabetes_X_test, diabetes_y_test)for alpha in alphas]

best_alpha = alphas[scores.index(max(scores))]

regr.alpha = best_alpha

>>> regr.fit(diabetes_X_train, diabetes_y_train)

Lasso(alpha=0.025118864315095794, copy_X=True, fit_intercept=True,

max_iter=1000, normalize=False, positive=False, precompute=False,

random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

>>> print(regr.coef_)

[ 0. -212.43764548 517.19478111 313.77959962 -160.8303982 -0.

-187.19554705 69.38229038 508.66011217 71.84239008]

Lasso ->using a coordinate decent method, that is efficient on large datasets.

LassoLars -> using the LARS algorithm which is is very efficient for problems in which the weight vector estimated is very sparse

LogisticRegression中C决定着正则化的程度: a large value for C results in less regularization. penalty="l2" gives Shrinkage (i.e. non-sparse coefficients), while penalty="l1" gives Sparsity.

逻辑回归LogisticRegression

Example:对于iris task，线型回归不是正确的方法，因为它给远离决策边界的数据太大的权重，有效的方法是fit logitstic function（更接近于阶跃信号）。

AdaBoost

提升方法 AdaBoost 算法

算法 (AdaBoost)
输入：训练数据集T={(x₁,y₁),(x₂,y₂)，...，(x_n，y_n)}，弱学习算法；
输出：最终分类器 G(x).
(1)初始化训练数据的权值分布
D₁=(w₁₁,...,w_1i,...,w_1N),w_1i= ,　　i=1,2,...,N

(2) 对 m= 1,2,…，M
(a) 使用具有权值分布化Dm的训练数据集学习，得到基本分类器G_m(x)
(b) 计算G_m(X) 在训练数据集上的分类误差率
em=P(G_m(x_i) y_i)=
(C) 计算 G_m (X) 的系数

= 这里的对数是自然对数.
(d) 更新训练数据集的权值分布
D_m+1=(w_m+1,1，…， w_m+1,i，…， w_m+1,N)

这里，是规范因子，，它使D_m+1成为一个概率分布。
(3) 构建基本分类器的线性组合得到最终分类器

对 AdaBoost 算法作如下说明：
步骤(1)假设训练数据集具有均匀的权值分布，即每个训练样本在基本分类器的学习中作用相同, 这一假设保证第一步能够在原始数据上学习基本分类器G₁(x)
步骤(2) AdaBoost 反复学习基本分类器，在每一轮m=1,2,…,M顺次地执行下列操作：
(a) 使用当前分布D_m加权的训练数据集，学习基本分类器 G_m(x).
(b) 计算基本分类器 G_m(x)在加权训练数据集上的分类误差率：
=P(G_m(x_i) y_i)= 由此可以看出数据权值分布D_m与基本分类器 G_m(x)的分类误差率的关系.
(c) 计算基本分类器 G_m (x) 的系数表示，在最终分类器中的重要性。当 <0.5时， >0, 并且随着的减小而增大，所以分类误差率越小的基本分类器在最终分类器中的作用越大.
(d) 更新训练数据的权值分布为下一轮作准备：

由此可知，被基木分类器误分类样本的权值得以扩大，而被正确分类样本的权值却得以缩小。两相比较，误分类样本的权值被放大倍.因此，误分类样本在下一轮学习中起更大的作用。不改变所给的训练数据，而不断改变训练数据权值的分布，使得训练数据在基本分类器的学习中起不同的作用，这是AdaBoost的一个特点。
步骤(3)线性组合f(x)实现M个基本分类器的加权表决.系数表示了基本分类器的重要性，这里，所有之和并不为1. f(x)的符号决定实例x的类， f(x)的绝对值表示分类的确信度。利用基本分类器的线性组合构建最终分类器是 AdaBoost 的另一特点。

支持向量机SVM

import numpy as np

import matplotlib.pyplot as plt

from sklearn import datasets, svm



iris = datasets.load_iris()

X = iris.data

y = iris.target



X = X[y != 0, :2]

y = y[y != 0]



n_sample = len(X)



np.random.seed(0)

order = np.random.permutation(n_sample)

X = X[order]

y = y[order].astype(np.float)



X_train = X[:.9 * n_sample]

y_train = y[:.9 * n_sample]

X_test = X[.9 * n_sample:]

y_test = y[.9 * n_sample:]



# fit the model

for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):

    clf = svm.SVC(kernel=kernel, gamma=10)

    clf.fit(X_train, y_train)



    plt.figure(fig_num)

    plt.clf()

    plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Paired)



    # Circle out the test data

    plt.scatter(X_test[:, 0], X_test[:, 1], s=80, facecolors='none', zorder=10)



    plt.axis('tight')

    x_min = X[:, 0].min()

    x_max = X[:, 0].max()

    y_min = X[:, 1].min()

    y_max = X[:, 1].max()



    XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]

    Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])



    # Put the result into a color plot

    Z = Z.reshape(XX.shape)

    plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)

    plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],

                levels=[-.5, 0, .5])



    plt.title(kernel)

plt.show()

聚类

k-means

Hierarchical agglomerative clustering: Ward分层聚类

Agglomerative - bottom-up

Divisive - top-down

from sklearn.feature_extraction.image import grid_to_graph
from sklearn.cluster import AgglomerativeClustering

# Generate data
lena = sp.misc.lena()
# Downsample the image by a factor of 4
lena = lena[::2, ::2] + lena[1::2, ::2] + lena[::2, 1::2] + lena[1::2, 1::2]
X = np.reshape(lena, (-1, 1))

###############################################################################
# Define the structure A of the data. Pixels connected to their neighbors.
connectivity = grid_to_graph(*lena.shape)

###############################################################################
# Compute clustering
print("Compute structured hierarchical clustering...")
st = time.time()
n_clusters = 15 # number of regions
ward = AgglomerativeClustering(n_clusters=n_clusters,
linkage='ward', connectivity=connectivity).fit(X)
label = np.reshape(ward.labels_, lena.shape)
print("Elapsed time: ", time.time() - st)
print("Number of pixels: ", label.size)
print("Number of clusters: ", np.unique(label).size)

Connectivity-constrained clustering关联限制聚类

Feature agglomeration特征聚集->用来降维数据

我们已经看到稀疏性可以用于减轻维度灾，即与特征的数量相比观察量不足。另一种方法是将类似特征合并在一起：特征聚集。这种方法可以通过在特征方向上聚类来实现，换句话说，对转置的数据进行聚类。

digits = datasets.load_digits()

images = digits.images

X = np.reshape(images, (len(images), -1))

connectivity = grid_to_graph(*images[0].shape)

agglo = cluster.FeatureAgglomeration(connectivity=connectivity,

... n_clusters=32)

>>> agglo.fit(X)

FeatureAgglomeration(affinity='euclidean', compute_full_tree='auto',...

X_reduced = agglo.transform(X)

X_approx = agglo.inverse_transform(X_reduced)

>>> images_approx = np.reshape(X_approx, images.shape)

Decompositions分解、降维

PCA

The point cloud spanned by the observations above is very flat in one direction: one of the three univariate features can almost be exactly computed using the other two. PCA finds the directions in which the data is not flat.

三个维度，有一个维度特别扁平，用其他俩个维度可以计算出来。PCA就是找不扁平的方向。

>>> # Create a signal with only 2 useful dimensions

>>> x1 = np.random.normal(size=100)

>>> x2 = np.random.normal(size=100)

>>> x3 = x1 + x2

>>> X = np.c_[x1, x2, x3]

>>> from sklearn import decomposition

>>> pca = decomposition.PCA()

>>> pca.fit(X)

PCA(copy=True, n_components=None, whiten=False)

>>> print(pca.explained_variance_)

[ 2.18565811e+00 1.19346747e+00 8.43026679e-32]

>>> # As we can see, only the 2 first components are useful

>>> pca.n_components = 2

>>> X_reduced = pca.fit_transform(X)

>>> X_reduced.shape

(100, 2)

ICA-> Independent Component Analysis

ICA selects components so that the distribution of their loadings carries a maximum amount of independent information. It is able to recover non-Gaussian independent signals:

ICA选择组件，使得它们的负载的分布携带最大量的独立信息。它能够恢复非高斯独立信号：

>>> # Generate sample data

>>> time = np.linspace(0, 10, 2000)

>>> s1 = np.sin(2 * time) # Signal 1 : sinusoidal signal

>>> s2 = np.sign(np.sin(3 * time)) # Signal 2 : square signal

>>> S = np.c_[s1, s2]

>>> S += 0.2 * np.random.normal(size=S.shape) # Add noise

>>> S /= S.std(axis=0) # Standardize data

>>> # Mix data

>>> A = np.array([[1, 1], [0.5, 2]]) # Mixing matrix

>>> X = np.dot(S, A.T) # Generate observations

>>> # Compute ICA

>>> ica = decomposition.FastICA()

>>> S_ = ica.fit_transform(X) # Get the estimated sources

>>> A_ = ica.mixing_.T

>>> np.allclose(X, np.dot(S_, A_) + ica.mean_)

True

Pipelining 管道

Example：结合transform模型和predict模型

The PCA does an unsupervised dimensionality reduction, while the logistic，regression does the prediction.

from sklearn import linear_model, decomposition, datasets

from sklearn.pipeline import Pipeline

from sklearn.grid_search import GridSearchCV

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()

pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

digits = datasets.load_digits()

X_digits = digits.data

y_digits = digits.target

# Plot the PCA spectrum光谱

pca.fit(X_digits)

plt.figure(1, figsize=(4, 3))

plt.clf()

plt.axes([.2, .2, .7, .7])

plt.plot(pca.explained_variance_, linewidth=2)

plt.axis('tight')

plt.xlabel('n_components')

plt.ylabel('explained_variance_')

# Prediction

n_components = [20, 40, 64]

Cs = np.logspace(-4, 4, 3)

#Parameters of pipelines can be set using ‘__’ separated parameter names:

estimator = GridSearchCV(pipe, dict(pca__n_components=n_components,logistic__C=Cs))

estimator.fit(X_digits, y_digits)

plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,

linestyle=':', label='n_components chosen')

plt.legend(prop=dict(size=12))

类型转换

1．没有特殊说明的话，输入需要（？）被转换为float64

X = np.array(X, dtype='float32')

>>> X.dtype

dtype('float32')

>>> transformer = random_projection.GaussianRandomProjection()

>>> X_new = transformer.fit_transform(X)

>>> X_new.dtype

dtype('float64')

2．回归结果将转换为float64，分类目标则保持原态

clf.fit(iris.data, iris.target)

>>> list(clf.predict(iris.data[:3]))

[0, 0, 0]

clf.fit(iris.data, iris.target_names[iris.target])

>>> list(clf.predict(iris.data[:3]))

['setosa', 'setosa', 'setosa']

Numpy

1．array vs list

python中的list是python的内置数据类型，list中的数据类不必相同的，而array的中的类型必须全部相同。

2．np.unique(iris_y)

Pandas

Example

Working With Text Data - 20 newsgroups dataset

Loaddata

from sklearn.datasets import fetch_20newsgroups



categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']

twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

# print twenty_train.target_names

# print len(twenty_train.data)

# print len(twenty_train.filenames)

# print twenty_train.data[0].split('\n')[:3]

# print twenty_train.target_names[twenty_train.target[0]]  # 装配

# print twenty_train.target[:10]

# print twenty_train.target_names

提取特征 - 词袋索引

# X as a numpy array of type float32 would require 10000 x 100000 x 4 bytes = 4GB in RAM

# which is barely manageable on today’s computers.

# Fortunately, most values in X will be zeros since for a given document less than a couple

# thousands of distinct words will be used.

# For this reason we say that bags of words are typically high-dimensional sparse datasets.

# scipy.sparse matrices are data structures that do exactly this,

# and scikit-learn has built-in support for these structures.



# Text preprocessing, tokenizing and filtering of stopwords are included in

# sklearn.feature_extraction.text.CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer



count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(twenty_train.data)

print X_train_counts.shape

Tfidf模型

# tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)

# X_train_tf = tf_transformer.transform(X_train_counts)

# print X_train_tf.shape

# print X_train_tf[0]

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

print X_train_tfidf.shape

Classifer

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

docs_new = ['God is love', 'OpenGL on the GPU is fast']

X_new_counts = count_vect.transform(docs_new)

X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):

    print('%r => %s' % (doc, twenty_train.target_names[category]))

pipeline

from sklearn.pipeline import Pipeline



text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', MultinomialNB()), ])

text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

Evaluation

import numpy as np



twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

docs_test = twenty_test.data

predicted = text_clf.predict(docs_test)

print np.mean(predicted == twenty_test.target)

Other Classifer

from sklearn.linear_model import SGDClassifier



text_clf = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),

                     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5, random_state=42)), ])

_ = text_clf.fit(twenty_train.data, twenty_train.target)

predicted = text_clf.predict(docs_test)

print np.mean(predicted == twenty_test.target)

混乱矩阵 - detailed performance analysis

from sklearn import metrics



print(metrics.classification_report(twenty_test.target, predicted, target_names=twenty_test.target_names))



print metrics.confusion_matrix(twenty_test.target, predicted)

调参数 - Parameter tuning

GridSearchCV

from sklearn.grid_search import GridSearchCV



parameters = {'vect__ngram_range': [(1, 1), (1, 2)],'tfidf__use_idf': (True, False),'clf__alpha': (1e-2, 1e-3), }

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=1)

gs_clf = gs_clf.fit(twenty_train.data[:400], twenty_train.target[:400])

print twenty_train.target_names[gs_clf.predict(['God is love'])]

best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])

for param_name in sorted(parameters.keys()):

    print("%s: %r" % (param_name, best_parameters[param_name]))

print score

posted @ 2017-03-20 10:43 zcbmxvn987 阅读(1235) 评论(0) 编辑收藏举报

刷新页面返回顶部

zcbmxvn987

机器学习笔记

统计学习方法

统计学习方法概论

基本概念

统计学习三要素

模型

策略

算法

模型的评价与模型选择

训练误差和测试误差

机器学习

加载数据集

加载数据

划分数据集

学习和预测

Example

保存模型

最近邻法

画出decision边界

模型选择:选模型、选参数

Score, and cross-validated scores

Grid-search and cross-validated estimators

维数灾

感知器

K近邻算法

线性模型

Shrinkage 收缩

Sparsity稀疏

逻辑回归LogisticRegression

AdaBoost

支持向量机SVM

聚类

Feature agglomeration特征聚集->用来降维数据

Decompositions分解、降维

PCA

ICA-> Independent Component Analysis

Pipelining 管道

类型转换

1．没有特殊说明的话，输入需要（？）被转换为float64

2．回归结果将转换为float64，分类目标则保持原态

Numpy

1．array vs list

2．np.unique(iris_y)

Pandas

Example

Working With Text Data - 20 newsgroups dataset

Loaddata

提取特征 - 词袋索引

Tfidf模型

Classifer

pipeline

Evaluation

Other Classifer

混乱矩阵 - detailed performance analysis

调参数 - Parameter tuning

公告